Although only about 15% of all waste within the EU is generated as municipal waste1, the absolute figures pose a major problem for municipalities, waste management companies and the environment. 225.7 million tonnes of municipal waste were collected in the EU in 2020, of which only 68 million tonnes were directly recycled, with the remainder going into long-term landfill or being incinerated for energy generation. In view of the climate-damaging landfill gases produced during storage or CO2 emissions during incineration, combined with the problem of the large amount of space required, the EU’s goal is to constantly optimise its waste management. This is intended to promote the production of less waste, a stronger circular economy and the economic efficiency of waste management. In the context of this optimisation, we want to work out a status quo of municipal waste management in Italian municipalities, on which subsequent optimisation projects can build. For this purpose, we base our work on a data set on the waste management of a total of 4341 Italian municipalities. With the help of these data, we are to draw up profiles of the municipalities, which we can cluster them with regard to their descriptive characteristics, in particular the key figures of waste management, but also geographical and economic factors.
Get an overview of the data set.
wm_df <- load2("data/waste_management.RData")
skimr::skim(wm_df)
| Name | wm_df |
| Number of rows | 4341 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Region | 0 | 1 | 5 | 21 | 0 | 20 | 0 |
| Provinz | 0 | 1 | 4 | 21 | 0 | 102 | 0 |
| Gemeinde | 6 | 1 | 2 | 60 | 0 | 4333 | 0 |
| Gebuehrenregelung | 0 | 1 | 4 | 8 | 0 | 2 | 0 |
| Region_PAYT | 0 | 1 | 2 | 4 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1.00 | 47469.93 | 30089.80 | 1272.00 | 18135.00 | 42015.00 | 70049.00 | 111107.00 | ▇▃▅▃▃ |
| Flaeche | 6 | 1.00 | 41.00 | 56.78 | 0.12 | 10.85 | 22.73 | 47.49 | 1287.39 | ▇▁▁▁▁ |
| Bevoelkerung | 0 | 1.00 | 10203.84 | 53426.40 | 34.00 | 1579.00 | 3535.00 | 8199.00 | 2617175.00 | ▇▁▁▁▁ |
| Bevoelkerungsdichte | 6 | 1.00 | 405.05 | 771.21 | 2.48 | 62.59 | 151.32 | 399.36 | 12122.83 | ▇▁▁▁▁ |
| Strassen | 443 | 0.90 | 101.93 | 309.99 | 1.00 | 25.00 | 51.00 | 105.00 | 14970.00 | ▇▁▁▁▁ |
| Inselgemeinde | 6 | 1.00 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| Kuestengemeinde | 6 | 1.00 | 0.17 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| Urbanisierungsgrad | 6 | 1.00 | 2.49 | 0.59 | 1.00 | 2.00 | 3.00 | 3.00 | 3.00 | ▁▁▆▁▇ |
| Geologischer_Indikator | 285 | 0.93 | 2.29 | 0.89 | 1.00 | 1.00 | 3.00 | 3.00 | 3.00 | ▃▁▂▁▇ |
| Abfaelle_gesamt | 0 | 1.00 | 5.31 | 32.54 | 0.02 | 0.61 | 1.52 | 3.95 | 1691.89 | ▇▁▁▁▁ |
| Abfaelle_sortiert | 0 | 1.00 | 3.25 | 15.62 | 0.00 | 0.37 | 1.04 | 2.73 | 765.13 | ▇▁▁▁▁ |
| Abfaelle_unsortiert | 0 | 1.00 | 2.04 | 17.64 | 0.01 | 0.18 | 0.41 | 1.06 | 926.76 | ▇▁▁▁▁ |
| Sortierungsgrad | 0 | 1.00 | 60.11 | 19.80 | 0.00 | 44.26 | 64.34 | 76.46 | 97.48 | ▁▃▅▇▅ |
| Sort_Bio | 511 | 0.88 | 22.28 | 12.75 | 0.01 | 11.13 | 24.98 | 31.84 | 61.64 | ▅▅▇▂▁ |
| Sort_Papier | 25 | 0.99 | 10.96 | 3.88 | 0.00 | 8.66 | 10.88 | 13.06 | 45.29 | ▃▇▁▁▁ |
| Sort_Glas | 33 | 0.99 | 9.41 | 3.71 | 0.00 | 7.15 | 9.10 | 11.28 | 39.84 | ▅▇▁▁▁ |
| Sort_Holz | 1095 | 0.75 | 4.11 | 2.72 | 0.00 | 2.08 | 4.02 | 5.71 | 25.12 | ▇▃▁▁▁ |
| Sort_Metall | 246 | 0.94 | 1.76 | 1.35 | 0.00 | 0.88 | 1.54 | 2.35 | 20.67 | ▇▁▁▁▁ |
| Sort_Plastik | 39 | 0.99 | 6.11 | 3.26 | 0.00 | 4.13 | 5.79 | 7.55 | 31.60 | ▇▆▁▁▁ |
| Sort_Elektrik | 314 | 0.93 | 1.23 | 0.82 | 0.00 | 0.78 | 1.18 | 1.57 | 17.95 | ▇▁▁▁▁ |
| Sort_Textil | 1013 | 0.77 | 0.76 | 0.69 | 0.00 | 0.35 | 0.63 | 0.99 | 10.58 | ▇▁▁▁▁ |
| Sort_Rest | 136 | 0.97 | 7.94 | 5.15 | 0.03 | 3.96 | 7.13 | 11.13 | 37.16 | ▇▆▂▁▁ |
| Verwendung_Energie | 0 | 1.00 | 20.28 | 15.68 | 0.00 | 5.63 | 18.54 | 38.50 | 55.12 | ▇▃▂▇▁ |
| Verwendung_Deponie | 0 | 1.00 | 17.95 | 19.46 | 0.00 | 4.55 | 11.29 | 31.49 | 76.69 | ▇▁▂▁▁ |
| Verwendung_Recycling | 0 | 1.00 | 41.29 | 12.84 | 2.35 | 32.68 | 43.23 | 51.57 | 76.69 | ▁▅▇▇▁ |
| Verwendung_Unbekannt | 0 | 1.00 | 20.47 | 17.82 | 0.00 | 5.21 | 17.77 | 30.98 | 97.38 | ▇▅▂▁▁ |
| Steuern_gewerblich | 386 | 0.91 | 16564.97 | 14131.50 | 4179.66 | 9079.69 | 12464.90 | 19413.38 | 377492.21 | ▇▁▁▁▁ |
| Steuern_privat | 285 | 0.93 | 13210.62 | 3648.87 | 2606.01 | 10161.97 | 13667.81 | 15758.65 | 35769.54 | ▂▇▃▁▁ |
| Kosten_Basis | 0 | 1.00 | 154.24 | 76.07 | 25.69 | 108.04 | 136.62 | 179.16 | 977.42 | ▇▁▁▁▁ |
| Kosten_Sortierung | 67 | 0.98 | 52.68 | 33.06 | 3.39 | 31.25 | 48.88 | 66.44 | 582.16 | ▇▁▁▁▁ |
| Kosten_sonstiges | 52 | 0.99 | 54.18 | 43.19 | 4.27 | 27.34 | 41.69 | 66.49 | 670.32 | ▇▁▁▁▁ |
wm_df %>%
complete.cases() %>%
sum()
> [1] 2018
There are quite a lot of missing values. Omitting all of them would mean a loss of more than 50 % of the data. Create some plots to enhance the understanding of the data set:
wm_df %>% na.omit %>%
ggplot(aes(x=Region, y=Abfaelle_gesamt)) +
ggtitle("Waste by Region") +
geom_boxplot(aes(fill = Region), outlier.shape = 2,
outlier.colour = "black",
outlier.alpha = .5) +
theme(aspect.ratio = 0.5) +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "none",
plot.title = element_text(hjust = 0.5))
A lot of outliers in total amounts of waste per community. I am going to take care of them later. Til then they will be hidden, so other values appear less compressed.
wm_df %>% na.omit %>%
ggplot(aes(x=Region, y=Abfaelle_gesamt)) +
ggtitle("Waste by Region") +
geom_boxplot(aes(fill = Region), outlier.shape = NA) +
theme(aspect.ratio = 0.3) +
coord_flip() +
coord_fixed(ylim = c(0, 29)) +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "none",
plot.title = element_text(hjust = 0.5))
wm_df %>% na.omit %>%
mutate(Geologischer_Indikator = ifelse(Geologischer_Indikator == 1, "South", ifelse(Geologischer_Indikator == 2, "Middle", "North"))) %>%
ggplot(aes(x=Geologischer_Indikator, y=Abfaelle_gesamt)) +
ggtitle("Waste by geological location") +
geom_boxplot(aes(fill = Geologischer_Indikator), outlier.shape = NA) +
coord_cartesian(ylim = quantile(wm_df$Abfaelle_gesamt, c(0, 0.97))) +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5))
wm_df %>%
na.omit %>%
mutate(`Urbanisation deg.` =
ifelse(Urbanisierungsgrad == 1, "low",
ifelse(Urbanisierungsgrad == 2, "mid",
"high")),
Population = Bevoelkerung,
`Total waste (kt)` = Abfaelle_gesamt) %>%
ggplot(aes(text=paste0("Municipality: ", Gemeinde), y=`Total waste (kt)`)) +
ggtitle("Waste by population & urbanisation degree") +
geom_point(aes(x=Population, colour = `Urbanisation deg.`)) +
theme(plot.title = element_text(hjust = 0.5)) -> p_waste_pop_urb
ggplotly(p_waste_pop_urb)
wm_df %>% na.omit %>%
mutate(Urbanisierungsgrad = ifelse(Urbanisierungsgrad == 1, "low urbanisation deg.",
ifelse(Urbanisierungsgrad == 3, "high urbanisation deg.",
"mid urbanisation deg."))) %>%
ggplot(aes(x=Urbanisierungsgrad, y=Abfaelle_gesamt)) +
ggtitle("Mean waste by urbanisation degree") +
stat_summary(aes(group = Urbanisierungsgrad, fill = Urbanisierungsgrad), fun = mean, geom = "bar") +
labs(y = "mean(Abfaelle_gesamt)") +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5))
Inspecting missing values within the waste sorting related columns:
sapply(wm_df[,16:25],
function(y) sum(length(which(is.na(y))))) %>%
tibble(Column = names(.),
NA_Count = .)
> # A tibble: 10 × 2
> Column NA_Count
> <chr> <int>
> 1 Sortierungsgrad 0
> 2 Sort_Bio 511
> 3 Sort_Papier 25
> 4 Sort_Glas 33
> 5 Sort_Holz 1095
> 6 Sort_Metall 246
> 7 Sort_Plastik 39
> 8 Sort_Elektrik 314
> 9 Sort_Textil 1013
> 10 Sort_Rest 136
options(width = 200)
Sort_NAs <- c()
for (i in 17:25) {
nas_t <- which(is.na(wm_df[i]))
Sort_NAs <- c(Sort_NAs, nas_t)
}
Sort_NAs_table <- Sort_NAs %>% table %>% sort(decreasing = T)
Sort_NAs_table[1:5] %>% names %>% as.numeric -> t
wm_df[t,16:25]
> Sortierungsgrad Sort_Bio Sort_Papier Sort_Glas Sort_Holz Sort_Metall Sort_Plastik Sort_Elektrik Sort_Textil Sort_Rest
> 429 0.72 NA 0.72 NA NA NA NA NA NA NA
> 3572 0.70 NA NA NA NA NA NA 0.7 NA NA
> 4175 0.00 NA NA NA NA NA NA 0.0 NA NA
> 3777 59.32 NA NA 39.84 NA NA NA NA NA 19.48
> 4017 1.87 NA NA NA NA NA 0.62 NA NA 1.25
Looks like the sum of the values present in these waste sorting columns equals the value in Sortierungsgrad. Imputing any values here would completely destroy the logic behind these columns. I would argue that dropping all these values is a viable option. However, one could also say that replacing these NAs with zeros instead might work out as well. The latter option is the one I chose.
Within the data set there are dimensions that hold no value when it comes to any analyses. ID is a unique identifier, Gemeinde the name of a community, Strassen contains more than 10 % missing values, Region and Provinz contain too many unique values that would complicate the process a lot. Also the importance of the information they hold is questionable.
cols_to_exclude <- c("ID",
"Gemeinde",
"Strassen",
"Region",
"Provinz")
A vector containing the names of the columns mentioned is created. A recipe from the tidyverse is used to remove these dimensions, replace missing values via bag imputation and replace outliers in numeric dimension with the IQR limits of their individual column and replace the remaining nominal columns with dummy variables.
Note: step_impute_constant() and step_outliers_iqr_to_limits() are functions from the steffanossaR package found at https://github.com/steffanossa/steffanossaR.
recipe_prep <- recipe(~., data = wm_df) %>%
step_rm(all_of(cols_to_exclude)) %>%
step_impute_constant(contains("Sort_"), constant = 0) %>%
step_impute_bag(all_numeric()) %>%
step_outliers_iqr_to_limits(all_numeric(), -ends_with("gemeinde")) %>%
step_other(all_nominal(), threshold = .001, other = NA) %>%
step_naomit(everything(), skip = T) %>%
step_dummy(all_nominal()) %>%
step_zv(everything())
A PCA is used to reduce the number of variables by finding principal components of the data, which are new and uncorrelated variables that can explain the most variance in the original data.
First, the Bartlett Test is done. The Bartlett Test verifies the null hypothesis that the correlation matrix is equal to the identity matrix, meaning that the variables of a given data set are uncorrelated. If the resulting value is below .05, the null hypothesis is rejected and it is concluded, that the variables are correlated2.
psych::cortest.bartlett(cor(wm_df_prepped), n = 100)
> $chisq
> [1] 2562.32
>
> $p.value
> [1] 4.570546e-286
>
> $df
> [1] 465
The value is way below 0.05, there is correlation between the dimensions of the data set.
Next, the Kaiser-Mayer-Olkin Criterion (KMO) is looked at. The KMO measures the adequacy of a data set for factor analysis. It ranges from 0 to 1, where a higher value indicates higher suitability. A value above .6 is generally considered to be the threshold. However, some sources also consider .5 to be acceptable3.
psych::KMO(wm_df_prepped)$MSA
> [1] 0.5683696
0.5683696 is not very good but I will consider this acceptable. Now the PCA can be executed.
options(width = 90)
wm_df_pca <-
wm_df_prepped %>%
prcomp(scale. = T,
center = T)
wm_df_pca %>%
summary()
> Importance of components:
> PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
> Standard deviation 2.5701 2.3620 1.59236 1.34903 1.25906 1.15444 1.03782 1.02187
> Proportion of Variance 0.2131 0.1800 0.08179 0.05871 0.05114 0.04299 0.03474 0.03368
> Cumulative Proportion 0.2131 0.3931 0.47484 0.53355 0.58468 0.62767 0.66242 0.69610
> PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
> Standard deviation 0.96945 0.9563 0.91464 0.88193 0.85826 0.80272 0.79652 0.76392
> Proportion of Variance 0.03032 0.0295 0.02699 0.02509 0.02376 0.02079 0.02047 0.01883
> Cumulative Proportion 0.72642 0.7559 0.78290 0.80799 0.83176 0.85254 0.87301 0.89183
> PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
> Standard deviation 0.74174 0.6909 0.64882 0.60965 0.55892 0.54515 0.47316 0.4524
> Proportion of Variance 0.01775 0.0154 0.01358 0.01199 0.01008 0.00959 0.00722 0.0066
> Cumulative Proportion 0.90958 0.9250 0.93856 0.95055 0.96062 0.97021 0.97743 0.9840
> PC25 PC26 PC27 PC28 PC29 PC30 PC31
> Standard deviation 0.43156 0.38944 0.32427 0.17288 0.09757 0.08770 0.06888
> Proportion of Variance 0.00601 0.00489 0.00339 0.00096 0.00031 0.00025 0.00015
> Cumulative Proportion 0.99004 0.99494 0.99833 0.99929 0.99960 0.99985 1.00000
Taking a look at the first 15 principal components (PC) and the percentage of variance they explain.
wm_df_pca %>%
factoextra::fviz_eig(ncp = 15,
addlabels = T)
The elbow method would suggest using three PCs. The cumulative variance of 49.942 % when taking only three is too little to result in useful outcomes.
Another method of evaluating the number of PCs to keep is the Kaiser Criterion which states that factors with an eigenvalue above 1 are considered important and should be retained. An eigenvalue above 1 means its factor explains more variance than a single variable would.
factoextra::get_eig(wm_df_pca) %>%
filter(eigenvalue > 1)
> eigenvalue variance.percent cumulative.variance.percent
> Dim.1 6.605264 21.307304 21.30730
> Dim.2 5.579191 17.997390 39.30469
> Dim.3 2.535613 8.179398 47.48409
> Dim.4 1.819884 5.870593 53.35469
> Dim.5 1.585226 5.113631 58.46832
> Dim.6 1.332728 4.299123 62.76744
> Dim.7 1.077079 3.474448 66.24189
> Dim.8 1.044212 3.368424 69.61031
8 factors possess eigenvalues above 1 with a cumulative variance of 69.61 %.
get_eig(wm_df_pca)[1:15,] %>%
ggplot(aes(y=eigenvalue, x=1:15)) +
ggtitle("Eigenvalue by Component") +
labs(x = "Component",
y = "Eigenvalue") +
geom_line() +
geom_point(col = "blue") +
scale_x_continuous(breaks = 1:15) +
geom_hline(yintercept = 1, linetype = "dashed") +
theme(plot.title = element_text(hjust = 0.5))
A third approach is Horn’s Method. Here random data sets with equal size (columns and rows) as the original data set are generated and then a factor analysis is performed on each of them. The retained number of factors are then compared. The idea is that if the number of factors kept in the original data set is similar to the number of factors kept in the random sets, the factors of the original data set are considered not meaningful. If the number of factors of the original data set is larger than the number of factors in the random sets, the factors in the original data set are considered meaningful4.
set.seed(187)
wm_df_prepped %>%
paran::paran()
>
> Using eigendecomposition of correlation matrix.
> Computing: 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
>
>
> Results of Horn's Parallel Analysis for component retention
> 930 iterations, using the mean estimate
>
> --------------------------------------------------
> Component Adjusted Unadjusted Estimated
> Eigenvalue Eigenvalue Bias
> --------------------------------------------------
> 1 6.447778 6.605264 0.157485
> 2 5.441013 5.579191 0.138178
> 3 2.412554 2.535613 0.123058
> 4 1.709277 1.819883 0.110606
> 5 1.486338 1.585225 0.098887
> 6 1.244675 1.332728 0.088052
> --------------------------------------------------
>
> Adjusted eigenvalues > 1 indicate dimensions to retain.
> (6 components retained)
Horn’s Method suggests a number of 6 PCs to keep. I chose to keep 8 with approximately 69.61 % cumulative variance. Next, we take a look at the contributions of the original variables to each new PC.
n_PCs <- 8
for (i in 1:n_PCs) {
factoextra::fviz_contrib(wm_df_pca, "var", axes = i) %>%
print
}
The psych package comes with a function that can illustrate the contribution of each original variable to the PCs in one plot.
wm_df_prepped %>%
psych::principal(nfactors = n_PCs) %>%
psych::fa.diagram()
A new data set is created based on the new dimensions.
wm_df_transformed_pca <-
as.data.frame(-wm_df_pca$x[,1:n_PCs])
Like the PCA a factor analysis can be used to reduce the dimensions of a data set. However, while the PCA creates uncorrelated variables the FA identifies underlying latent factors that explain the relationships among the original variables in the data set. The factors the FA puts out might be correlated, so a rotation can be used in make these factors as uncorrelated as possible.
First, a vector containing all rotation methods is created. Then we iterate over each of them using a for-loop.
options(width = 300)
rot_meth <- c("varimax",
"quartimax",
"equamax",
"oblimin",
"promax")
max_cumvar <- 0
for (rm in rot_meth) {
cat("Factor Analysis results. Rotation method: ", rm, "\n")
res <- factanal(wm_df_prepped,
factors = n_PCs,
rotation = rm,
lower = 0.1) %>%
steffanossaR::ex_factanal()
if (res[3,n_PCs] > max_cumvar & res[3,n_PCs] <= 1) {max_cumvar <- res[3,n_PCs]}
print(res)
cat("\n")
}
> Factor Analysis results. Rotation method: varimax
> Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
> SS loadings 5.2605001 4.2311596 2.43491023 1.72982899 1.47423168 1.3672520 1.17974164 1.16537040
> Proportion Var 0.1696936 0.1364890 0.07854549 0.05580094 0.04755586 0.0441049 0.03805618 0.03759259
> Cumulative Var 0.1696936 0.3061826 0.38472806 0.44052900 0.48808486 0.5321898 0.57024594 0.60783854
>
> Factor Analysis results. Rotation method: quartimax
> Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
> SS loadings 5.2339896 4.2670551 3.2393682 1.63982088 1.33641019 1.31312776 1.07467596 0.7385470
> Proportion Var 0.1688384 0.1376469 0.1044957 0.05289745 0.04311001 0.04235896 0.03466697 0.0238241
> Cumulative Var 0.1688384 0.3064853 0.4109811 0.46387851 0.50698851 0.54934747 0.58401444 0.6078385
>
> Factor Analysis results. Rotation method: equamax
> Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
> SS loadings 4.3650982 2.97044887 2.5651322 1.98388108 1.90522786 1.90121230 1.72566190 1.42633231
> Proportion Var 0.1408096 0.09582093 0.0827462 0.06399616 0.06145896 0.06132943 0.05566651 0.04601072
> Cumulative Var 0.1408096 0.23663055 0.3193767 0.38337291 0.44483188 0.50616130 0.56182782 0.60783854
>
> Factor Analysis results. Rotation method: oblimin
> Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
> SS loadings 4.438703 3.1594732 1.9144761 1.73885944 1.48121743 1.47968744 1.35645032 1.33117830
> Proportion Var 0.143184 0.1019185 0.0617573 0.05609224 0.04778121 0.04773185 0.04375646 0.04294124
> Cumulative Var 0.143184 0.2451025 0.3068598 0.36295201 0.41073321 0.45846507 0.50222153 0.54516276
>
> Factor Analysis results. Rotation method: promax
> Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8
> SS loadings 5.0079686 3.6562588 3.6140890 1.72102808 1.47748204 1.45830140 1.44928675 0.91531511
> Proportion Var 0.1615474 0.1179438 0.1165835 0.05551703 0.04766071 0.04704198 0.04675119 0.02952629
> Cumulative Var 0.1615474 0.2794912 0.3960747 0.45159176 0.49925247 0.54629445 0.59304564 0.62257193
62.26 % is the maximum amount of cumulative variance with 8 factors. Approximately 10 % less than what the PCA yielded. Therefore PCA will be used.
To assess the clustering tendency of a data set the Hopkins Statistic can be used. It measures the probability that a given data set was generated by a uniform data distribution. The higher the resulting value the better the clustering tendency. Values range from 0 to 15.
#wm_df_transformed_pca %>%
# get_clust_tendency(n = nrow(wm_df_transformed_pca) - 1, graph = F)
~ 0.83 is quite good.
Clustering can be done with different approaches. Agglomerative methods start with every observation being considered a single-elementcluster. Then the two most similar clusters are combined into a new cluster. This is done until there is one cluster containing everything. First, a distance matrix and then clusters are created. For both steps there are multiple methods of creation.
dist_meth <- c("euclidean",
"maximum",
"manhattan",
"canberra",
"minkowski")
daisy_meth <- c("euclidean",
"manhattan",
"gower")
hclust_meth <- c("single",
"complete",
"average",
"mcquitty",
"median",
"centroid",
"ward")
pdf("figs/dendro_dist_hclust.pdf", width = 16, height = 9)
for (dm in dist_meth) {
for (cm in hclust_meth) {
hdist <- dist(scale(wm_df_transformed_pca),
dm)
hcl <- flashClust::flashClust(hdist, cm)
plot(hcl,
main = paste0("dist method: ", dm,
"\nclust method: ", cm),
sub = NA,
labels = FALSE)
}
}
dev.off()
> png
> 2
pdf("figs/dendro_daisy.pdf", width = 16, height = 9)
for (dm in daisy_meth) {
for (cm in hclust_meth) {
daisydist <- daisy(wm_df_transformed_pca,
metric = dm,
stand = TRUE)
flashClust::flashClust(daisydist, cm) %>%
plot(main = paste0("dist method: ", dm,
"\nclust method: ", cm),
sub = NA,
labels = FALSE)
}
}
dev.off()
> png
> 2
#' *...*
All possible combinations can be found in figs/dendro_dist_hclust.pdf and figs/dendro_daisy.pdf. The dist + hclust combination of Canberra and ward.D2 looks most promising.
hclust_a <-
dist(scale(wm_df_transformed_pca),
method = "canberra") %>%
flashClust::flashClust(method = "ward")
hclust_a %>% plot(main = paste0("dist method: canberra",
"\nclust method: ward"),
sub = NA,
labels = FALSE)
The amount of viable clusters seems to range from 2 to 7. Microsoft Excel can be used to comfortably compare clusters for profiling and help determining the number of clusters to choose.
for (i in 2:7) {
wm_df_prepped_clust_a <-
wm_df_prepped %>%
mutate(Cluster_a = cutree(hclust_a, k = i))
profiling_df <-
wm_df_prepped_clust_a %>%
group_by(Cluster_a) %>%
mutate(n_island_municipalities = sum(Inselgemeinde),
Inselgemeinde = NULL,
n_coastal_municipalities = sum(Kuestengemeinde),
Kuestengemeinde = NULL) %>%
summarize_all(mean)
profiling_df$n_Cluster <- i
profiling_df$Size <- table(wm_df_prepped_clust_a$Cluster_a)
if (i == 2) {
profiling_df_final <- profiling_df
# write to csv
write.table(profiling_df_final %>%
filter(n_Cluster == i) %>%
mutate(round(., 2)),
file = "profiling_hclust_a.csv",
sep = ";",
dec = ",",
row.names = F)
} else {
profiling_df_final <-
profiling_df_final %>%
rows_append(profiling_df)
# append rows with leading blank line to csv
write.table("",
file = "profiling_hclust_a.csv",
sep = ";",
append = T,
col.names = F,
dec = ",",
row.names = F)
write.table(profiling_df_final %>%
filter(n_Cluster == i) %>%
mutate(round(., 2)),
file = "profiling_hclust_a.csv",
sep = ";",
append = T,
col.names = F,
dec = ",",
row.names = F)
}
}
Using VBA, the differences of the means of the different clusters can be emphasised. I use this macro for highlighting.
Sub Highlighting()
Dim data_end_col As Integer
Dim cluster_group_col As Long
Dim cluster_count As Integer
Dim i As Integer
Dim j As Integer
Dim group_end_row As Integer
' Dim chr As String
' get col of cluster groups
' chr = InputBox("ColName of cluster groupings", "Text Input", "n_Cluster")
' cluster_group_col = Application.WorksheetFunction.Match(chr, Range(Cells(1, 1), Cells(1, data_end_col)), 0)
i = 2
j = 2
data_end_col = Cells(1, Columns.Count).End(xlToLeft).Column
cluster_group_col = Application.WorksheetFunction.Match("n_Cluster", Range(Cells(1, 1), Cells(1, data_end_col)), 0)
' loop through the number of clusters
Do While i <= ActiveSheet.UsedRange.Rows.Count
cluster_count = Cells(i, cluster_group_col)
group_end_row = i + cluster_count - 1
' loop through dimensions
For j = 2 To data_end_col
Range(Cells(i, j), Cells(group_end_row, j)).Select
' add highlighting
Selection.FormatConditions.AddColorScale ColorScaleType:=3
Selection.FormatConditions(Selection.FormatConditions.Count).SetFirstPriority
Selection.FormatConditions(1).ColorScaleCriteria(1).Type = _
xlConditionValueLowestValue
With Selection.FormatConditions(1).ColorScaleCriteria(1).FormatColor
.Color = 13011546
.TintAndShade = 0
End With
Selection.FormatConditions(1).ColorScaleCriteria(2).Type = _
xlConditionValuePercentile
Selection.FormatConditions(1).ColorScaleCriteria(2).Value = 50
With Selection.FormatConditions(1).ColorScaleCriteria(2).FormatColor
.Color = 16776444
.TintAndShade = 0
End With
Selection.FormatConditions(1).ColorScaleCriteria(3).Type = _
xlConditionValueHighestValue
With Selection.FormatConditions(1).ColorScaleCriteria(3).FormatColor
.Color = 7039480
.TintAndShade = 0
End With
Next j
i = group_end_row + 2
Loop
End Sub
Unlike with the agglomerative approach, the divisive method begins with a single cluster containing every observation and with each step existing clusters are divided into smaller ones until therea are as many clusters as observations.
diana <- diana(wm_df_transformed_pca, metric = "euclidean", stand = TRUE)
pltree(diana)
to be continued
Municipal waste is all waste collected and treated by or for municipalities. It includes waste from households, including bulky waste, similar waste from trade and commerce, office buildings, institutions and small businesses, as well as yard and garden waste, street sweepings and the contents of waste containers. The definition excludes waste from municipal sewage systems and their treatment, as well as waste from construction and demolition work.↩︎
Reference: https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm↩︎
Reference: https://www.empirical-methods.hslu.ch/entscheidbaum/interdependenzanalyse/reduktion-der-variablen/faktoranalyse/↩︎
Reference: doi:10.1007/bf02289447↩︎
Reference: https://www.datanovia.com/en/lessons/assessing-clustering-tendency/↩︎